2023-12-02
Student ID: U2016609
Tabula statement
We’re part of an academic community at Warwick. Whether studying,
teaching, or researching, we’re all taking part in an expert
conversation which must meet standards of academic integrity. When we
all meet these standards, we can take pride in our own academic
achievements, as individuals and as an academic community. Academic
integrity means committing to honesty in academic work, giving credit
where we’ve used others’ ideas and being proud of our own achievements.
In submitting my work I confirm that:
1. I have read the guidance on academic integrity provided in the Student Handbook and understand the University regulations in relation to Academic Integrity. I am aware of the potential consequences of Academic Misconduct.
2. I declare that the work is all my own, except where I have stated otherwise.
3. No substantial part(s) of the work submitted here has also been submitted by me in other credit bearing assessments courses of study (other than in certain cases of a resubmission of a piece of work), and I acknowledge that if this has been done this may lead to an appropriate sanction.
4. Where a generative Artificial Intelligence such as ChatGPT has been used I confirm I have abided by both the University guidance and specific requirements as set out in the Student Handbook and the Assessment brief. I have clearly acknowledged the use of any generative Artificial Intelligence in my submission, my reasoning for using it and which generative AI (or AIs) I have used. Except where indicated the work is otherwise entirely my own.
5. I understand that should this piece of work raise concerns requiring investigation in relation to any of points above, it is possible that other work I have submitted for assessment will be checked, even if marks (provisional or confirmed) have been published.
6. Where a proof-reader, paid or unpaid was used, I confirm that the proofreader was made aware of and has complied with the University’s proofreading policy.
7. I consent that my work may be submitted to Turnitin or other analytical technology. I understand the use of this service (or similar), along with other methods of maintaining the integrity of the academic process, will help the University uphold academic standards and assessment fairness.
Privacy statement
The data on this form relates to your submission of coursework. The date
and time of your submission, your identity, and the work you have
submitted will be stored. We will only use this data to administer and
record your coursework submission.
Related articles Reg. 11
Academic Integrity (from 4 Oct 2021)
Guidance on Regulation 11 Proofreading Policy
Education Policy and Quality Team Academic Integrity
(warwick.ac.uk)
The internet enables individuals to share their opinions by posting text on various platforms. It becomes crucial for businesses to analyse users’ reviews since it can improve their service quality based on the analysis by establishing relevant strategies.
This paper will base on the OSEMN framework (Obtain, Scrub, Explore,
Model, and iNterpret) to predict Yelp users’ ratings on businesses.
OSEMN is a data science methodology which prioritizes a direct route to
model analysis with efficiency and less preliminary business
understanding process (Saltz, Sutherland and Hotz 2022). The focus is to
construct and select models. Selecting model should be paid caution to
since the Yelp ratings can be considered as categorical, which are
incompatible with some linear models. Additionally, the consumers’
reviews are text contents, so the computational algorithms is needed for
model construction. OSEMN framework satisfies this priority.Thus, this
paper uses classification random forest models with text mining to
predict users’ reviews on Yelp.
The original datasets are available on the Yelp website (Yelp, 2019).
It contains ratings and other relevant information about users and
businesses. Some sample data are randomly drawn for quicker processing.
After taking the overlap and merging three of the original datasets,
dropping all the missing values and the variables extremely skewed, a
sample size of 279,878 data was generated.
To capture the features of the variables that will be included in the
model, summary statistics is shown as follows:
| variable | Min | Max | Median | Mean |
|---|---|---|---|---|
| stars.y | 1 | 5 | 4 | 3.75 |
| funny.y | 0 | 346 | 0 | 0.33 |
| useful.y | 0 | 1182 | 0 | 1.18 |
| cool.y | 0 | 400 | 0 | 0.50 |
| average_stars | 1 | 5 | 3.88 | 3.746 |
| review_count.y | 0 | 8363 | 24 | 114.6 |
| review_count.x | 5 | 568 | 136 | 370 |
| fans | 0 | 2547 | 0 | 11.8 |
| stars.x | 1 | 5 | 4 | 3.751 |
| compliment_plain | 0 | 8974 | 0 | 30.34 |
| latitude | 27.56 | 53.65 | 38.6 | 35.89 |
| longitude | -120.1 | -74.66 | -86.18 | -89.64 |
| is_open | 0 | 1 | 1 | 0.83 |
| text | \ | \ | \ | \ |
The relevant variables used in this paper are described in the table
below:
| Variables | Description |
|---|---|
| text | users’ text comments |
| stars.y | the response variable, discrete number range from 1 to 5, means the rates users gave to the business |
| stars.x | the average stars the business got |
| average_stars | the average stars the users gave to every business |
| fans | the number of fans of the user |
| review_count.y | discrete, means the accumulate review count of the user |
| review_count.x | discrete, means the accumulate review count of the business |
| funny.x | discrete, meaning the number of users who think the comment is funny |
| useful.x | discrete, meaning the number of users who think the comment is useful |
| cool.x | discrete, meaning the number of users who think the comment is cool |
| is_open | binary, 1 represents the business is opened while 0 means it is closed |
| compliment_plain | users’ information, discrete |
| longitude | longitude of the business, continuous |
| latitude | latitude of the business, continuous |
The variables with median smaller than mean are skewed to the lower
values. In contrast, the median of stars.y is bigger than the
mean, indicating the data skews towards higher values. As can be seen in
the below pie chart, it indicates the portion of each class of stars.
Most people rated 5 and 4 stars, while a few people rated 2 and 3.
After the tokenization, a word cloud with decreasing word frequency is presented below. As can be seen, most of the words are positive or neutral.
Random forest is a supervised learning algorithm and often be used in regression and classification problems (Karthika, Murugeswari and Manoranjithem, 2019). M training samples are constructed to build multiple decision trees which will be merged together to gain a stable value with variance be averaged, where M usually equals the square roots of the sample size (Hegde and Padma, 2017). stars.y is discrete numeric data with only 5 outcomes, so the prediction should be discrete as well to obtain the precise ratings. Thus, stars.y will be treated as categorical variable in classification tree. After drawing 10,000 testing data randomly, the models below are constructed based on the rest of the training data.
A classification tree model (Class Model 1) is constructed for
predicting stars.y. Since we can deduced that stars.y
is correlated with stars.x and average_stars, these
variables will be included. Additionally, review_count.x,
review_count.y, useful.x, funny.x, cool.x, latitude, longitude, is_open,
compliment_plain and fans are included to test the
importance of the variables.
Gini index represents the node impurity. Summing up the Gini
decreases for each individual variable over all trees in the forest
gives a variable importance (Hegde and Padma, 2017). In general, the
higher value of mean decrease accuracy and mean decrease Gini, the more
important the variable is to the model. As can be seen in the graph
above, average_stars is important to explain the response
variable. To improve the model further, text mining will be used for
generating sentiment score.
Text data presents users’ attitudes, it gives implications on the ratings. Text Mining is used for exploring the attitudes. Such as sentiment analysis, a computational study of people’s opinions, emotions, and attitudes towards an entity (Medhat, Hassan and Korashy, 2014), can provide sentiment scores according to the negative and positive opinion words in the text. Based on the lexicon corpus provided by Murali (2017), the process of calculate sentiment score is as follows:
A density plot of the sentiment score is shown above. It
asymptotically follows a normal distribution with a mean of 3.89. The
minimum score is -66, while the maximum score is 57. It implies most of
the users hold positive opinions, which is in line with the
stars.y pie chart and the word cloud, indicating a relation
between score and the response variable stars.y.
Class Model 2 with 50 trees is constructed based on Class Model 1
with score included to see the impact of sentiment score. The
variable importance graph is as follows:
As can be seen, score tops the list, meaning it is the most
important variable among all.
The above Class Models’ performance table is shown as follows:
| Model | OBB error rate | variables at each split | Number of tree |
|---|---|---|---|
| Class Model 1 | 41.93% | 3 | 80 |
| Class Model 2 | 39.04% | 3 | 80 |
Some observations do not add to the boostrap sample when constructing
the model and these are referred to “out of bag data”, which are useful
for estimating generalization error and variable importance (Prajwala,
2015). As can be seen, model with score performs better with
lower OBB error rate, indicating a higher accuracy for predicting.
Testing data is used to test the model’s performance. In terms of accuracy and kappa, class model 2 performs better, which means the correct prediction probability and agreement between sets of observations are relatively high. As can be seen in the table below, including sentiment scores improves the model’s performance by 2.44% in accuracy.
| Model | Class Model 1 | Class Model 2 |
|---|---|---|
| Accuracy | 59.73% | 62.17% |
| Kappa | 0.3794 | 0.4218 |
See the confusion matrix in the appendix, the sensitivity and accuracy for different classes diverge. For class 5, the sensitivity and balanced accuracy values are high. Take class model 2 as the case: for class 1 and 5, the model performs well. It has 82% and 86% sensitivity, 87% and 75% accuracy, and 93% and 64% specificity, respectively. But for class 2, 3, and 4, the sensitivity is 15%, 13%, and 33%, respectively, and the balanced accuracy is 56%, 55%, and 61%, respectively, which are low. The Mosaic plot of prediction and actual value is presented below:
The rectangular tile represents a combination of levels from
prediction and actual value. Tile size is proportional to the number of
predictions falling into the correct actual class. As can be seen, the
proportion of correct predictions of class 1 and 5 are high, followed by
class 4.
Sentiment analysis of text can improve the accuracy of predictions
based on stars ratings. The classification model 2 can generate the most
accurate prediction with a 62.17% probability for the testing data. For
1 and 5 stars, the true positive and negative cases are well identified
by the model, as well as the accuracy rate, while stars 2 and 3 have the
opposite situation.
The lexicon-corpus-based approach in sentiment analysis is unable to
find opinion words with domain and context-specific orientation, and
this will cause bias during the text analysis.
Although average_stars can represent the pattern of the user
when he rate the business, reverse causality may still occur between
stars.y and average_stars. Increasing stars.y
will increase average_stars, but increasing in
average_stars may not lead to increase in stars.
It is hard to determine which model fits this dataset the best. When
facing with large size of data, testing all the proposed model is time
consuming. Considering there are only 5 outcomes for stars.y, i
decided to treat it as categorical and use decision tree. However, i
tried different explaining variables in the model and trained many
times, the accuracy of the model was still below 60% with high error
rate. Then, i used sentiment analysis to assign sentiment score to the
data and it improves the model performance and finally the accuracy
reached above 60%.
The classification random forest model can perform generally well in
predicting users’ review on businesses, especially when the actual stars
ratings are 1 or 5. Sentiment score gives good implication on ratings.
More advanced natural language techniques which can find opinion words
with context-specific orientation are suggested for further study.
Hegde, Y. and Padma, S.K., 2017, January. Sentiment analysis using random forest ensemble for mobile product reviews in Kannada. In 2017 IEEE 7th international advance computing conference (IACC) (pp. 777-782). IEEE.
Karthika, P., Murugeswari, R. and Manoranjithem, R., 2019, April. Sentiment analysis of social media network using random forest algorithm. In 2019 IEEE international conference on intelligent techniques in control, optimization and signal processing (INCOS) (pp. 1-5). IEEE.
Medhat, W., Hassan, A. and Korashy, H., 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4), pp.1093-1113.
Murali, S. (2018). Sentiment-Analysis-of-Twitter-Data-by-Lexicon-Approach. [online] GitHub. Available at: https://github.com/Surya-Murali/Sentiment-Analysis-of-Twitter-Data-by-Lexicon-Approach/tree/master [Accessed 27 Nov. 2023].
Prajwala, T.R., 2015. A comparative study on decision tree and random forest using R tool. International journal of advanced research in computer and communication engineering, 4(1), pp.196-199.
Saltz, J., Sutherland, A. and Hotz, N., 2022. Achieving Lean Data
Science Agility Via Data Driven Scrum.,Proceedings of the 55th
Hawaii International Conference on System Sciences
Yelp (2019). Yelp Dataset. [online] Available at: https://www.yelp.com/dataset.